%load_ext pretty_jupyter
from pyspark.sql import SparkSession

from pyspark.sql.window import Window
from pyspark.sql.functions import col, min, max, lead, lag, lit, date_format

import pandas as pd

RDD (Resilient Distributed Datasets):

a. Transformations vs Actions

  1. Transformations: PySpark transformation functions which produce RDDs, DataFrames or DataSets in results.

Transformations are lazy operations meaning none of the transformations get executed until you call an action on Spark RDD.

1.1 Narrow Transformations: - These types of transformations convert each input partition to only one output partition. - When each partition at the parent RDD is used by at most one partition of the child RDD or when each partition from child produced or dependent on single parent RDD. - This kind of transformation is basically fast. - Does not require any data shuffling over the cluster network or no data movement. - in RDD: 1. map() 2. filter() 3. flatMap() 4. sample() 5. union()

  • in Dataframe
    1. select()
    2. filter() or where()
    3. withColumn()
    4. drop()
    5. alias()
    6. sample()